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Abstract — We consider a large-scale cyber network with TV 
components (e.g., paths, servers, subnets). Each component is 
either in a healthy state (0) or an abnormal state (1). Due to 
random intrusions, the state of each component transits from 
to 1 over time according to certain stochastic process. At each 
time, a subset of K (K < TV) components are checked and those 
observed in abnormal states are fixed. The objective is to design 
the optimal scheduling for intrusion detection such that the long- 
term network cost incurred by all abnormal components is min- 
imized. We formulate the problem as a special class of Restless 
Multi-Armed Bandit (RMAB) process. A general RMAB suffers 
from the curse of dimensionality (PSPACE-hard) and numerical 
methods are often inapplicable. We show that, for this class of 
RMAB, Whittle index exists and can be obtained in closed form, 
leading to a low-complexity implementation of Whittle index 
policy with a strong performance. For homogeneous components, 
Whittle index policy is shown to have a simple structure that 
does not require any prior knowledge on the intrusion processes. 
Based on this structure, Whittle index policy is further shown to 
be optimal over a finite time horizon with an arbitrary length. 
Beyond intrusion detection, these results also find applications in 
queuing networks with finite-size buffers. 

I. Introduction 

The objective of Intrusion Detection Systems (IDS) is to 
locate malicious activities {e.g., denial of service attack, port 
scans, hackers) in the quickest way such that the infected parts 
can be timely fixed to minimize the overall damage to the net- 
work. With the increasing size, diversity, and interconnectivity 
of the cyber system, however, intrusion detection faces the 
challenge of scalability: how to rapidly locate intrusions and 
anomalies in a large dynamic network with limited resources. 
The two basic approaches to intrusion detection, namely, 
active probing and passive monitoring (TJ, 0, face stringent 
resource constraints when the network is large and dynamic. 
Specifically, active-probing based approaches need to choose 
judiciously which components of the network to probe to 
reduce overhead; passive-monitoring based approaches need 
to determine how to sample the network so that real-time 
processing of the resulting data is within the computational 
capacity of the IDS Q. The problem is compounded by the 
fact that the adversarial behaviors are typically random and 
evolving. 

In this paper, we address resource-constrained intrusion 
detection in large dynamic cyber networks. Specifically, we 
consider a network with TV heterogeneous components which 
can be paths, routers, or subnets. At a given time, a component 
can be in a healthy state or an abnormal state. An abnormal 



component remains abnormal until the anomaly is detected and 
resolved. A healthy component may be attacked and become 
abnormal if the attack is successful. We consider a general 
attack model: the behavior of the intruder can be arbitrarily 
correlated in time and varies across components, and different 
attacks can be launched with different probabilities of suc- 
cessfully compromising the component under attack. As a 
consequence, the state of a component evolves according to an 
arbitrary stochastic process until it is probed/sampled. When a 
healthy component is probed/sampled, its state evolution {i.e., 
how likely it will become abnormal in each subsequent time 
instant) is reset. This models the scenario where proactive ac- 
tions are taken (patches are installed, firewalls upgraded, etc.) 
by the IDS when probing/sampling a component to refresh 
its immunity to attacks. Note that this model is significantly 
different and more complicated than the SIS (susceptible- 
infected-susceptible) model and its variants (see, e.g., H). 

For each component in an abnormal state, a cost (depending 
on the criticality of the component) per unit time is incurred. 
At each time, the IDS can choose a subset of K components 
to probe or sample (K is often much smaller than TV due to 
resource constraints). The question here is how to dynamically 
probe or sample these TV components to minimize the long- 
term cost over time. The key is to learn from past observations 
and decisions and dynamically adjust the probing/sampling 
actions. 

A. Main Results 

We formulate the dynamic intrusion detection problem as 
a special class of Restless Multi-Armed Bandit (RMAB) 
process, where each component is considered as an arm. While 
finding the optimal solution to a general RMAB problem is 
PSPACE-hard with exponential complexity in system size 
we show that for this class of RMAB at hand, several 
structural properties exist that lead to simple robust solutions. 
Specifically, by exploring the reset nature of the problem, we 
first show that a sufficient statistic for choosing the optimal 
probing/sampling actions is given by a two-dimensional vector 
of each arm that can be easily updated at each time. This 
significantly reduces the state space for optimal decision 
making. Second, we show that this RMAB is indexable, thus 
an index policy — referred to as Whittle index policy — with 
strong performance and linear complexity in the size TV of 
the cyber network can be constructed. Third, we show that 
the Whittle index can be obtained in closed form, leading 
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to negligible complexity of implementation. Fourth, we show 
that for homogeneous components, the low-complexity Whittle 
index policy has a simple robust structure that does not 
need any prior knowledge on the stochastic attack model and 
achieves the optimal performance. 

In the context of RMAB, our results contribute to the study 
of the existence and optimality of Whittle index policy. In 
1988, Whittle generalized the classic MAB to RMAB, a more 
powerful stochastic model to take into account system dynam- 
ics that cannot be directly controlled J5J. Whittle proposed an 
index policy that has been shown to be asymptotically (when 
the system size approaches infinity) optimal under certain 
conditions Q, (8). The difficulty of Whittle index policy 
lies in the complexity of establishing its existence (the so- 
called indexability) and computing the index. There is no 
general characterization regarding which class of RMAB is 
indexable, and little is known about the optimality of Whittle 
index (when it does exist) for finite-size systems. In this 
paper, we present a significant class of indexable RMAB 
with practical applications for which Whittle index policy 
is shown to be optimal for homogeneous arms. This result 
lends a strong justification for the existence and the optimality 
of linear complexity algorithms based on the Whittle index. 
Beyond intrusion detection, this special class of RMAB and 
the corresponding results can also be applied to the holding 
cost minimization problem in queuing networks with finite- 
size buffers, as elaborated in Sec. IVIII 

B. Related Work 

In 15), the problem of intrusion recognition by classifying 
system patterns was addressed based on data mining. Without 
resource constraint, the focus is on the best selection of system 
features to detect intrusion from the accessible system data 
statistics. Similar problems of statistical modeling of data and 
detection algorithms under various scenarios were considered 
in a number of papers, e.g., |[T0ll - lfT3ll . These studies mainly 
address the intrusion detection problem from a machine learn- 
ing or pattern recognition perspective and do not consider the 
constraint on the system monitoring capacity. Our work is 
a stochastic control approach for intrusion detection in large 
networks with resource constraints, where the problem of how 
to adaptively allocate the limited detecting and repair power 
for performance optimization is of great interest. In lfT4l . a 
set of heuristic detection, path selection and link anomaly 
localization algorithms were proposed based on the active 
probe-enabled network measurements. In 11151 . the intrusion 
detection problem was formulated as a zero-sum game with 
two players (the intruder and the IDS), where the game evolu- 
tions and outcomes were studied through numerical examples 
based on Markovian decision processes and Q-learning. The 
previous algorithm designs mainly take into account the static 
or Markovian dynamics of the networks. The results in this 
paper thus represent a step forward over the previous work by 
addressing the general non-Markovian network dynamics. 

In the literature of RMAB, the indexability was studied 
in lfl6l . where efficient algorithms were constructed to numer- 



ically test indexability and compute Whittle index for finite- 
state systems. For the problem at hand, the system state space 
is infinite, and thus numerical methods are generally infeasible, 
even for a fixed realization of system parameters. We show 
that, however, indexability holds regardless of the system 
parameters and Whittle index can be solved in closed-form. 
The optimality of Whittle index policy was subsequently estab- 
lished for homogenous arms. For a special class of RMAB as 
detailed in the next paragraph, the optimality of Whittle index 
policy was established for homogeneous arms under certain 
conditions. In general, the optimality of Whittle index policy 
has rarely been established. Nevertheless, numerical studies 
have demonstrated the near-optimality of Whittle index policy 
for numerous RMAB models (see, e.g., lfT7l - l20l ). 

In the context of dynamic spectrum access and multi- 
agent tracking systems, a class of RMAB modeled by a two- 
state Markovian model was considered in |2~T1 . l22l . The 
indexability was established and Whittle index was solved in 
closed form. The Markovian model yields special structural 
properties of the system dynamic equations that significantly 
simplify the establishment of the indexability and Whittle 
index. However, these structural properties no longer hold 
for the RMAB considered here that deals with arbitrary 
underlying random processes, and the approaches in lETl . 11221 
do not apply. In this paper, we propose a new approach for 
establishing the indexability and the closed-form Whittle index 
based on a comparing argument on the optimal stopping times. 
Besides the RMAB model at hand, this approach is extendable 
to general two-state reset processes with partially observable 
states. In [22], Whittle index policy was shown to be equivalent 
to the myopic policy for homogeneous arms, which leads to 
its optimality under certain conditions based on the previous 
results on the myopic policy established in Il23l - ll25l . Again, 
the approaches in Il23l - ll25l are based on the special properties, 
e.g., the linearity of the value function, of the myopic policy 
under the Markovian model. For the problem at hand, although 
the equivalence between Whittle index policy and the myopic 
policy is preserved for homogeneous arms, the properties 
under the Markovian model no longer hold. To show the 
optimality, we take a different approach by establishing the 
monotonicity of the value function, as detailed in Sec. [V] 

II. Network Model 

Consider a cyber network with N inhomogeneous compo- 
nents that are subject to random attacks over time. At each 
discrete time, each component is either in the healthy state (0) 
or the abnormal state (1). If an attack to a healthy component 
is successful, the component enters the abnormal state until it 
is probed and fixed. We assume that different components ex- 
perience statistically independent but not necessarily identical 
attack processes. 

Each attack process can be arbitrarily correlated over time. 
Consequently, the state evolution of a component is given by 
an arbitrary probability sequence {p n (t)}t>o, where p n (t) is 
the probability that component n enters state 1 after t steps 
since the last time it was probed. Specifically, if a component 



3 



<ln 




Fig. 1. An example based on the Markovian state model. 



(say, component n) is probed and observed in state 0, a simple 
maintenance action is taken which resets its state evolution 
according to {p n (t)}t>o- If component n is observed in state 1, 
a sophisticated repair action is taken, and the component will 
be back to the normal state in the next time instant and then 
evolve according to {p n (t)}t>o- Note that {p n (t)}t>o is a 
monotonically increasing sequence since state 1 is absorbing 
when the component is unobserved. A simple example is 
given by the i.i.d. attack process, where component n is 
compromised with a constant probability q n £ (0,1) at each 
time. For this example, the state of component n transits as a 
Markov chain shown in Fig. [T] and we have 

p n (t) = 1-(1 -q n )\ 

which monotonically converges to 1 at the geometric rate (1 — 
q n ) as t increases. In general, we do not require any specific 
form of {p n {t)}t>o- 

For each abnormal component (say, component n), a cost 
c n is incurred per unit time. With limited resource, only 
a subset of K {K < N) components can be probed for 
maintenance/repair. The objective is to minimize the long- 
term average network cost by designing the optimal sequential 
component probing policy. 

III. RMAB Formulation 

In this section, we formulate the intrusion detection problem 
as a special class of Restless Multi-Armed Bandit (RMAB) 
process. The concepts of indexability and Whittle index are 
also introduced. 

A. RMAB and Sufficient Statistics 

In a general RMAB, a player chooses K out of N indepen- 
dent arms to activate at each time based on the current states of 
all arms. At each time, the state of each arm transits according 
to two potentially different Markovian rules depending on 
whether it is made active or passive. Each arm contributes 
an immediate reward depending on its current state and the 
imposed action. The objective is to maximize the long-term 
reward by optimally selecting arms to activate over time based 
on the arm state evolutions. 

We need to note that the states of all arms are assumed to 
be completely observable and obey Markovian transition rules 
in an RMAB. However, for the intrusion detection problem at 
hand, the state (0/1) of each component is not observable 

'Parallel results can be obtained for the model in which a repaired 
component cannot be guaranteed to be healthy in the next time instant and 
are omitted here due to the space limit. 



unless it is probed, and the state transition rules are non- 
Markovian in general. It is thus not suitable to model the 
component state as the arm state. By exploring the reset nature 
of the problem, we show in the next lemma that a sufficient 
statistic for optimal decision making is given by the two- 
dimensional vector set {(i n ,t n )}^ =1 , where i n G {0,1} is 
the last observed state of component n and t„ the time lapsed 
since the last observation. As a consequence, we can treat 
(i n ,t n ) as the arm state of component n, which is complete 
observable but with an infinite dimension. In the rest of paper, 
we refer to (i n ,t n ) as the arm state of component n to 
distinguish it from the component state S„ G {0/1}. We also 
let a n G {active/probe (1), passive/not probe (0)} denote the 
probing action on arm n. 

Lemma 1: For the intrusion detection problem, the vector 
set {{imtn)}n=i is a sufficient statistics for optimal decision 
making. Furthermore, given the current probing actions and 
observations, the arm state (i n ,t n ) of component n transits 
according to the following Markovian rules. 

r (0,1), if a n = l, S n = 

r(i„,*„)=< (1,1), if a n = 1, S„ = 1 , 

[ (i n ,t n + 1), if a n = 

where T(-) denotes the one-step transition of the arm state 
given the current arm state and action. 

Proof: Recall that each active action on each component 
(say, component ri) resets its state evolution according to the 
probability sequence {p n (t)}t>o (see Sec.|ll|i. Given (i n ,t n ), 
the future state statistics of component n is independent of pre- 
vious actions and observations. The vector set {(i n , t n )}^ =1 is 
thus a sufficient statistic. The one-step update of {(i n , t n )}„ =1 
is straightforward. ■ 
Now we complete the RMAB formulation of the intrusion 
detection problem by observing that the immediate reward 
Rn(S n ) offered by component n can be modeled by — c n 
if it is currently in the abnormal state and otherwise. 
Consequently, the reward maximization is equivalent to the 
cost minimization. In the rest of the paper, we use RMAB- 
IDS to denote this class of RMAB. 

B. The Optimality Equation 

In this subsection, we establish the optimality equation for 
RMAB-IDS. We consider the following strong average-reward 
criterion under which not only the steady-state average reward 
but also the transient reward starting from an arbitrary initial 
arm state is maximized, leading to the maximum long-term 
total reward growth rate. 

N 

G + F({(i n ,t n )}% =1 ) = maxE^V R n {S n ) (1) 

A * — ' 

n=l 

+ F({r(i n ,t n \a n ,S n )}^ 1 )}, 

where A = {a n }n=i w i m S n =i a " = ^ denotes the current 
probing actions, G the maximum steady-state average reward 
over the infinite horizon, F(-) the transient reward starting 
from the initial arm states, and E^[-] the expectation operator 
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given 4. Solving the optimality equation ([T]i suffers from the 
curse of dimension and has an exponential complexity for 
dynamic programming. In Sec. IIV1 we show that for RMAB- 
IDS, the linear-complexity Whittle index policy exists and can 
be obtained in closed form with a near-optimal performance. 

C. Definition of Whittle Index Policy 

The key idea of Whittle index policy is to provide a subsidy 
for passivity to measure the attractiveness of activating an arm 
based on its current state. Based on the strong decomposability 
of Whittle index, it is sufficient to focus on each single arm (6). 

1 ) Single-Armed Bandit with Subsidy: Consider the single- 
armed bandit for the intrusion detection problem with only 
one arm/component. At each time instant, we decide whether 
to activate the arm or make it passive. Assume that a subsidy 
for passivity, denoted by A, is gained whenever the arm is 
made passive. We have the following optimality equations. 
For simplicity of presentation, we will drop the component 
index from the notations. 

g + f(0,t) = max{A-p(f)c+/(0,i+l), 

-p(t)c + p(t)/(l, 1) + (1 - p(t))f(0, 1)} 
= max{A + /(0,i+l), (2) 

p(t)/(i,i) + (i-K*))/(o,i)}, 

g + f(l,t) = max{A + /(M+l), (3) 
p(t-l)/(l,l) + (l-p(t-l))/(0,l)}, 

where g and /(•) denote, respectively, the maximum steady- 
state average reward and the transient reward by playing the 
single arm. The optimal policy for this single-arm problem is 
essentially given by an optimal partition of the arm state space 
Ui=o,i{(h *)}*>! int0 a passive set 

V(X) = {(», t) : a*(i, t, A) = 0} 
= {(*, t):\ + f(i,t+l) 

> P (t- i) + (i-p(f-*))/(o,i)} 

and its complement, an active set .4(A) = {(i, t) : 
a*(i, t, A) = 1}, where a*(i, t, A) denotes the optimal 
action at arm state (£, t) under subsidy A. 

2) Indexability and Whittle Index: To define Whittle index 
policy, it is required that the RMAB is indexable J6). 

Definition 1: An RMAB is indexable if for each arm, the 
passive set V(X) increases monotonically from the empty set 
<f> to the entire state space |Ji=i a{(*> *)}*>i as trie surj sidy A 
increases from — oo to +oo. An RMAB is strictly indexable if 
the states join the passive set one by one (instead of as groups) 
as A continuously increases. 

Given the indexability, the Whittle index W(i, t) of an arm 
state (i, t) is defined as the infimum subsidy A that makes the 
passive action optimal at (i, t): 

W(i, t) = inf{A : a*(i, t, A) = 0} 
= inf{A: X + f(i,t+l) 

> p(i -*)/(!, + P(t-i))f (0,1)}- 



Whittle index essentially measures how attractive it is to 
activate an arm based on subsidy A. The minimum subsidy 
A that is needed to move an arm state from the active set to 
the passive set under the optimal partition thus measures how 
attractive this arm state is. 

Whittle index policy is naturally given by playing the K 
arms with the largest Whittle indexes. 

IV. Indexability and the Closed-Form Whittle 
Index for RMAB-IDS 

In this section, we establish the indexability of RMAB-IDS 
and solve for Whittle index in closed form. Based on the 
indexability and Whittle index, we study the optimal policy 
for RMAB-IDS under a relaxed constraint. 

A. Indexability 

Theorem 1: RMAB-IDS is indexable. 
Proof: Consider the single-armed bandit with subsidy. 
Without loss of generality, we assume that the cost c = 1. 
Define stopping time ti as the number of steps until the 
first activation after observing the arm in component state 
i G {0, 1}. We can rewrite the dynamic equations (O and OJ 
as follows. 

to 

/(0) = max{-gt + A(t - 1) - V#) 
* - 1 fe=i 

+p(fo)/(i) + (i-p(*o))/(o)}, 

ti 

/(l) = matf-^i + ACti-lJ-^pfa-l) 

fc=i 

+p(t 1 -l)/(l) + (l-p(t 1 -l))/(0)}, 

where f(i) (i £ {0,1}) is the transient reward starting from 
arm state (i,0). Note that we can set /(0) = since only 
/(l) — /(0) is determined by the above equations. We thus 
have 

to 

= max{- 5 t + A(t - 1) - YVfc) (4) 

t0 - 1 fe=i 

+p(*o)/(l)}. 

ti 

/(l) = max{-5ti + A(ti - 1) - V#-l) (5) 
tl > 1 fc=i 
+p(h - 1)/(1)}. 

To prove indexability, it is equivalent to prove that the 
optimal {t*}i = oj in (|4]i and (O are nondecreasing with A. For 
the case that A < 0, all states are in the active set, i.e., t* = 1 
for i £ {0, 1}. This is because that both the time portion of the 
occurrence of the abnormal component state and the passive 
time are minimized by always activating the arm. 

Consider the case that A > 0. We should always make 
the arm passive if the observation of the component state 
in the previous slot is 1, since the current component state 
is guaranteed to be after repair and there is no benefit to 
observe it again. Consequently, t\ > 1. Combined with (HJi 
and ©, we further observe that t* = t$ + 1. Note that this 
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holds not only for the optimal stopping times {i*}i=o.i but 
also for all stationary policies with t\ > 1. By considering t* 
in (|4|i and (|5}, we can solve for /(l) and g and obtain 
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A(tS-l+p(tS))-E£=iP(fc) 



(6) 



*u+i>(*o) 

Now suppose that it is better to activate the arm at the t- 
th step instead of any earlier step after observing component 
state 0. We have 

A(f-l+p(f))-EtiPW 
t + p(t) 

s + p(sj 

We can further simplify (0 and obtain for all s € {1, ■ • • , t}, 

A(t- a +p(t) -*(*)) 



fc=i 



fe=i 



p(k)(t + p(t)). 



(8) 



Based on the monotone property of {p(t)}t>o, we have t — 
s+p(t) —p(s) > and © keeps true as A (A > 0) increases. 
Equivalently, the set of t for which (0 and ([8]) are true is 
nondecreasing in A. We thus conclude that {t*(X)}i=o,i are 
nondecreasing in A. Since this further implies that V(X) is 
nondecreasing in A, we proved the indexability. ■ 

B. The Closed-Form Whittle Index 

Given the indexability established in Sec IIV-AI we proceed 
to solve for the closed-form Whittle index of RMAB-IDS. 
For simplicity of presentation, we focus on the case that the 
bandit is strictly indexable (see Definition [TJ, i.e., there is 
no tie among the Whittle indexes. A simple condition in the 
following is adopted to guarantee the strict indexability. 
CI: p{t + 1) — p(t) is strictly decreasing with t. 

Note that CI is always satisfied under the Markovian state 
model (see Sec. As shown in the following theorem, under 
CI, RMAB-IDS is strictly indexable. The closed-form Whittle 
index function is subsequently obtained. 

Theorem 2: Under CI, RMAB-IDS is strictly indexable and 
the Whittle index W(-) is given below. 



W(0,t) 
W(l,t) 



p(t + l)(t + p(t)) 
l+p{t + l)-p(t) 



t 

£ 

fc=l 



W(0,t-1), W(0,0) = 0. 



p(k))c, (9) 



(10) 



Proof: We first prove the following lemma that establishes 
a sufficient and necessary condition for strict indexability and 
the associated Whittle index. 

Lemma 2: Define W(0,t) as in ©. RMAB-IDS is strictly 
indexable if and only if W(0, t) is strictly increasing with t. 
In this case, the Whittle index of state (i,t) (i 6 {0,1}) is 
given by © and ( TTOt . 

Proof: Without loss of generality, we assume that the cost 
c = 1. We first prove the necessity. If the bandit is strictly 
indexable, the states {(0,t)}t>i join the passive set one by 



one as A continuously increases. From the proof of Theorem[T] 
after observing component state 0, it is optimal to activate the 
arm at the t-th step under subsidy A if and only if 



d(t,s) w 

A > — 7~ — 7" , VKt, 



A < 



c(t, s) 
d(u,t) 
c(u, t) 



V u>t, 



(11) 



(12) 



where 



c(x,y) = x - y +p(x) -p(y), 

x y 

d(x,y) = ^2p(k)(y+p(y)) -^p(k)(x + p(x)). 

k=l fe=l 

Consider an arbitrary v > 1. If both (fTTT i and (fT2l hold with 
equality by letting (it, t, s) = (v+2, v+1, v) and A = W(Q, v), 
than Whittle indexes for states (0, v) and (0, v + 1) would be 
the same. This contradicts the strict indexability. We thus have 
that d(v + 1, v)/c(v + 1, v) is strictly increasing at v. 

Now we prove the sufficiency. Assume that W(0,t) is 
strictly increasing with t. This implies that W(0,t) is positive 
for all t since 

W(0,1) = P (2)-p(l)+p 2 (l)>0. 

For an arbitrary v > 1, there must exist a subsidy A > such 
that both ( fTTT i and dT~2T > hold with strict inequality by letting 
(u,t,s) = (v + 2,v + l,v). So the Whittle index for state 
(0, v) is smaller than this A while the Whittle index for state 
(0,v + 1) is larger than it. This proves the strict indexability. 

Under the strict indexability, if we set the subsidy A< as the 
Whittle index of state (0, t), then it is optimal to either activate 
on (0, t) or wait one more step to activate on (0, t + 1). We 
thus have 



\ t c(t+l,t) = d(t + l,t), 



(13) 



which leads to the Whittle index of state (0, t) as given 
in (|9]l. Recall that for any nonnegative subsidy, the optimal 
activation time after observing component state 1 is one step 
later compared to that after observing component state 0. we 
arrive at W{1, t) = W(0, t- 1) for t > 2. Based on the proof 
of Theorem[T] it is not hard to see that W(l, 1) = 0. We thus 
proved the lemma. ■ 

Based on Lemma [2] we only need to prove that CI im- 
plies the strict monotonicity increasing property of W(0,t). 
Equivalent, for any t > 1, we need to prove 



d(t + 2,t+l) d(t+l,t) 
c(t + 2,t+l) > c(t + l,t) ' 



(14) 



Define 6(t) =p(t + 1) — p(t) which is positive under CI. By 
simplifying (fl4l i. it is equivalent to prove 

pit + l)tS(t) + p 2 (t + l)5(t) + 8(t)5(t + 1) 
+p(t + l)t+ P 2 (t + 1) + 5(t + l)(t + 1) 
> p(t)tS(t + 1) + p 2 (t)6it + 1) 
+5(t)6{t + l)p(i) + pit)t + p 2 (t) + 5(t) P {t) + 5{t)t. (15) 
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Since p(t) is increasing and 8(t) is strictly decreasing with t 
(under CI), we have 

P (t + l)tS(t) + P 2 (t + l)S(t) + 5{t)5(t + 1) 
> p(t)tS(t + 1)+ P 2 (t)5(t + 1) + S(t)6(t + l)p{t). (16) 

To prove (fT5l) . it is sufficient to prove 

p(t + l)t+p 2 (t + 1) + S(t + l)(t+ 1) 



> p(t)t+p 2 (t) + S(t)p{t) + S(t)t. 



(17) 



After some simplifications of ( 1171 . we need to prove 

6(t)p(t + 1) + 6(t + l)(t + 1) >0, (18) 

which is always true under CI. We thus proved Theorem [2] 

■ 

The near-optimal performance of Whittle index policy is 
observed through numerical examples (see Sec.lVU. In SeclVl 
we show that when all components are homogeneous, Whittle 
index policy is equivalent to the myopic policy and achieves 
the optimal performance. 

C. The Optimal Policy under a Relaxed Constraint 

In this subsection, we consider the scenario with a relaxed 
resource constraint, where we only require the average number 
of activated arms to be no more than K. This scenario 
often arises in systems where the resource constraint is more 
strict on the average value rather than the peak value, e.g., 
the energy-saving systems. Under the relaxed constraint, the 
indexability and the Whittle index leads to a simple optimal 
policy for RMAB-IDS. 

As explained by Whittle in J6), the subsidy A for passivity 
is essentially the Lagrangian multiplier for the general RMAB 
with the following relaxed constraint 



<K, 



(19) 



where K(t) is the number of activated arms at time t. 
Specifically, the subsidy A controls the expected time portion, 
i.e., the stead-state probability 7r„(A), that arm n (1 < n < N) 
is made active under the corresponding single-arm optimal 
policy. For RMAB-IDS, under the optimal subsidy A*, we 
have 



lim 



*=i 



N 

£ 

n=l 



7Tn(A*) = K 



(20) 



and ( fl~9T > is satisfied with equality. 

Given the optimal subsidy A*, the optimal policy under the 
relaxed constraint is simply given by the composition of N 
independent single-arm optimal policies (applied on the N 
arm respectively) under the common subsidy A*. Specifically, 
at each time, if the Whittle index of an arm is larger than 
A* then we activate the arm; otherwise we make the arm 
passive. Note that if the Whittle index of an arm is equal 
to A*, randomizing between the active and passive actions 
would be necessary to satisfy ( f20l > as detailed in J7). Given 



the closed-form Whittle index established in Theorem it 
remains to solve for the optimal subsidy A*. Note that based 
on the Lagrangian multiplier theorem |6|, we have 



JV 



A* = argmin{^ g„(A) - (N - K)X}, 



(21) 



71=1 



where g n (X) is the maximum average reward of arm n under 
the single-arm policy for subsidy A and is convex in A. From 
the closed-form Whittle index, it is not hard to solve for 
the optimal stopping times {t*(A)}i = o,i (see (0) and the 
maximum average reward g(X) for each A. We can then obtain 
the optimal A* from (|2TT > by any classic algorithm for finding 
the minimum of a convex function. 

V. Optimality in Homogeneous Networks 

In this section, we study the performance of Whittle index 
policy in homogeneous networks, i.e., all components have the 
same parameters: the probability sequence {p(t)}t>o and the 
per-unit cost c for being abnormal. 

We first establish the equivalence of Whittle index policy 
with the myopic policy for homogeneous components. In 
general, the myopic policy chooses the K components to 
solely minimize the expected cost in the next slot. It is not hard 
to show that for homogeneous components, the myopic policy 
is reduced to choosing the K components with the largest 
probabilities of being in the abnormal state. The myopic action 
A{-) as a function of the current states of all arms is thus given 
below. 

A({i n ,t n }^ =1 ) = argmax{ Pr(S„ — l\(i„, t„))} 

A z — ' 

n:a n =1 

= aigmaxj ^ (p(t n ){l - i n ) 



n:a n =1 

+p(t n - l)i„)}. 



(22) 



Lemma 3: For homogeneous components, Whittle index 
policy is equivalent to the myopic policy and has the following 
simple structure: initialize a queue in which components are 
ordered according to the descending order of their initial 
probabilities of being in the abnormal state. Each time we 
probe the K components at the head of the queue. In the next 
slot, these K components will be moved to the bottom of 
the queue while keeping those observed in state 1 a higher 
position than those observed in state 0. 

Proof: Based on the proof of Theorem Q] the Whittle 
index W(i,t) of an arm is monotonically increasing with t for 
fixed i g {0, 1} and W(l, t) = W(0, t - 1) with W(0, 0) = 0. 
Based on the monotonic increasing property of {p(t)}t>o> it is 
not hard to see that the Whittle index W(i, t) is monotonically 
increasing with Pr(5 = l\(i,t)). Whittle index policy is thus 
equivalent to the myopic policy for homogeneous arms. 

From the equivalence of Whittle index policy with the 
myopic policy, its structure is straightforward since based on 
the current observations, all components observed in state 
1 will have zero probability of being abnormal and those 
observed in state will have the second smallest probability 
p(l) of being abnormal, while those unobserved arms will 
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have the same rank in the probability of being abnormal in 
the next slot based on the monotonicity of {p(t)}t>o- ■ 

From Lemma [3] Whittle index policy can be implemented 
without knowing the system parameters {p{t)}t>o and c. 
Furthermore, Whittle index policy is optimal, as given in the 
following theorem. 

Theorem 3: For homogeneous components, Whittle index 
policy minimizes the expected total cost over a finite time 
horizon of an arbitrary length T (T > 1). It is thus also 
optimal under the strong average-reward criterion over the 
infinite time horizon. 

Proof: We prove the theorem based on a backward 
induction on the time horizon. Any policy, including Whittle 
index policy, is optimal at the last time instant t = T since 
the current action affects only the future cost but not the 
immediate cost. Now assume that Whittle index policy is 
optimal at time instants t+ 1, t + 2, ■ ■ ■ ,T. We need to prove 
that it is optimal at time t. Without loss of generality, we set 
c = 1. Let Q(t) = (ui,u 2 ,--- ,un) with oj n G {p(t)} t >o 
denote an unordered set consisting of probabilities that the 
N components are in state 1 at time t. Define the value 
function Vt(Cl(t)) of Whittle index policy as the expected total 
cost from time t up to T, Consider a policy that activate 
the K components with probabilities {oj\,uj 2 ,- • ■ i^k) of 
being in state 1 at time t and follows Whittle index policy 
in the future time instants up to time T. The value function 
Vt(uii,ui2, ■ ■ ■ , wjy), i.e., the expected total cost from time t 
up to T, of this policy is given by 

N 

V t (oJi,uj2, ■ ■ ■ ,ojn) = y^fc 

k=l 

+ E[V t+1 (0,--- ,0, p(l),-- - ,p(l)y ,t(oj n ))}, 
ki times k times 
where the expectation is taken over the random variables 
{&i}i=o,i (fci + fco = K) that denote respectively the number 
of components observed in state 1 and state 0, and r(-) 
denote the one-step update of the abnormal probability for 
unobserved components based on fp(t)}t>o- Note that if 
U)\ > oj 2 > • • • ojff, then Vt = Vt. 

To prove that Whittle index policy, i.e., the myopic policy, 
is optimal at time t, it is sufficient to prove that for any y > 
x, x,y G {p(t)} t >o, 

V t (ui, ■■■ ,y,--- ,x,--- ,wjv) 
< V t (u)i, ■■■ ,x, ■■■ ,y, ■■■ ,wjv). (23) 

This means that a component with higher probability of being 
in state 1 should be given a higher priority. To show d23l ). 
we first present the following lemma that establishes the 
monotonicity of the value function of Whittle index policy. 

Lemma 4: The value function Vt (wi , u>2, ■ • • , ^>n ) of Whit- 
tle index policy is an increasing function at each entry u> n (n G 
{1,2,-- ,N}). 

Proof: Without loss of generality, we assume that all 
probabilities within Vt(-) are in a descending order. The proof 



is based on a backward induction on time t. If t = T, 
the claim is clearly true. Assume that the lemma holds for 
s = t + 1, t + 2, ■ • ■ , T. Consider time t. We need to show 

V^yM) > V t (ul,xM), V y > x, x,y G {p{t)} t >o, (24) 

where UJ2 are arbitrary (possibly empty) probability vectors 
with \wi\ + \cjt\ = N - 1. 

Define ti > 1 as the first stopping time that the compo- 
nent denoted by y/x in d24l i is probed under Whittle index 
policy. Based on the structure of Whittle index policy, t\ is 
deterministic. We have 

ti 

V t (uj u y,(J 2 ) = J 4(cJ*i,cj 2 ) + ^r fc - 1 (y) (25) 

fc=i 

+E[r t '- 1 (y)V t+tl (i3' 1 ,0,uj! 2 ) 

+(l-r*- 1 (j/))F t+tl (^,Kl),^)], 
ti 

Vt{u u x,Q 2 ) = A(w 1 ,w 2 ) + ^T fc " 1 (x) (26) 

fc=i 

+E[r tl - 1 (.T)^ +tl K,0,4) 

+ (l-T^-\x))V t+ MM^2)}, 

where A(l?i,lJ2) is the expected total cost up to t\ deter- 
mined by components other than that denoted by y/x, vectors 
uj'ijUj^) are stochastically determined by i3x,&2 based on the 
observations between time t and t + ti — 1, and r fc (-) denotes 
the fc-th iteration of operator r(-). We point out that based on 
the structure of Whittle index policy, the total cost A{dT\,u)2) 
does not depend on the state of the component denoted by 
y/x. From d25l l and (l26l l. we have that (l24l holds if 

E (r k -\y) r^ix)) + (r^iy) r tl " 1 (x))E[ 

k=l 

1 + V t+ M,0,^2) -V t+ M,p(l), uj' 2 )} >0. (27) 
From the monotonic increasing property of {p(i)}t>o> we have 

r k (y) - r k {x) > 0, V y > x, x, y G {p(t)}t>o, * > 0. 
To show ( |27] |. it is sufficient to show 

E[l + V t+ tM,0,^)-Vt+tM,P0-)^2)} >0. (28) 

Starting from time t + 1\, define t 2 as the first stopping time 
that the component denoted by 0/p(l) is probed under Whittle 
index policy. Between time t + 1\ to t + 1\ + 1 2 , the difference 
in the expected total cost incurred by this component when its 
abnormal probabilities are respectively given by and p(l) is 
equal to p(t 2 ). This is because that the update of the abnormal 
probability when staring from is one step lagged of that 
from p(l). Again, based on the structure of Whittle index 
policy, the expected total cost incurred by other components 
is independent of the state of this component. By expanding 
the value function in (f28b at time t + 1\ + t 2 and after some 
simplifications, it is equivalent to show 

E[l-p(i 2 ) + (p(t2-l)-p(t 2 ))E[ 

v t+ t 1+ tMA^2) - v t+tl+t2 (u}'; lP (i),^)} > 0, (29) 



g 



where vectors uj'^uj^) are stochastically determined by wi, w 2 
based on observations between time £ + fi and t + £i +£2 — 1. 
By induction, for any w",^', 

y t+tl+t2 K,o,4') - v t+t , +t2 (u'{, P (i),^) < 0. 

It is thus not hard to see that (|29l l holds. Note that for the 
realizations of £1 and £2 such that £ + ti > T and/or t + t\ + 
£2 > T, the monotonicity of the conditional value function is 
straightforward to prove. We thus proved the lemma. ■ 
Now we are ready to prove d23l ). If the positions of y and x 
are both in top K or both after top K, then the inequality 
holds with equality. Consider the case that y is in top K but 
x not. We have for any probability vectors {aJ,}i = i,2,3, 

Vt(wi,y,w 2 ,x, uj 3 ) 
= E[yV t+1 {uj' 1 ,io' 2l t(x), £3, 0) 

< E[yy t+ i(wi,w 2 ,r(y),a;3,0) 

+(l-»)^+i(aJi J ^,r(y),^ I Kl))] 

< E[zF t+ i(^,4,T(y),4,0) 

+ (1 - x)^t + i(w^w 2 ,r(?;),cj3,p(l))] 
= 14(^1, x,w 2 ,?/, £3), 

where w^w^u^ are stochastically determined by cTJi, W2, 0)3 
based on the observation at time £, and the two inequalities 
are due to Lemma 0] We thus proved the optimality of Whittle 
index policy over a finite horizon of an arbitrary length T. By 
contradiction, if Whittle index policy is not optimal under the 
strong average-reward criterion, there must exist a To such that 
Whittle index policy performs worse than the optimal policy 
over the horizon of length To- Consequently, Whittle index 
policy is also optimal under the strong average-reward criterion 
over the infinite time horizon. ■ 

VI. Numerical Examples 

In this section, we present some numerical examples and 
evaluate the performance of Whittle index policy for nonho- 
mogeneous components. 

In Fig. 12] we illustrate the Whittle index as a function of 
the arm state. The monotonicity and concavity of the Whittle 
index are observed. In Fig. [3] we compare the performance 
of Whittle index policy versus the optimal policy. Due to the 
complexity of the dynamic programming problem given in (fl]i, 
we only computed the optimal cost over a short time horizon. 
Note that the cost under the non-stationary optimal policy over 
a finite time horizon is a lower bound on that achieved by 
the stationary optimal policy over the infinite time horizon. 
We observe that Whittle index policy achieves a near-optimal 
performance. 

In Fig. |U we compare Whittle index policy with the 
myopic policy over a long time horizon. We observe that for 
inhomogeneous components, Whittle index policy outperforms 
the myopic policy, and the performance improvement becomes 
significant as time goes. 
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Fig. 2. The Whittle index ({p(i)} <t<8 =[0,0.5,0.7,0.85, 

0.95,0.97,.975,.978,.98], c = 1). 
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Fig. 3. The near-optimality of Whittle index policy 

(K = 1, N = 4, {p n (*)}n=l,2,... ,4,0<i<6 

[0, .5, .7, .85, .95, .97, .975; 0, .3, .4, .48, .54, .57, .59; 

0, .36, .46, .5, .53, .55, .56; 0, .6, .78, .9, .96, .98, .99], {cn}n=i,a,..- ,4 = 

[.8, 1, 1.2, .9], all components start from the healthy state). 

Numerical results similar to the above have been observed 
through extensive examples with randomly generated system 
parameters. 

VII. Applications to Queuing Networks 

Another application of the RMAB model considered in this 
paper is on holding cost minimization in queueing networks. 
Consider a queuing network where customers randomly arrive 
at K servers. As shown in Fig. [5] all servers share a set of 
N finite-size buffers (for N different classes of customers) 
that are either empty or full based on the batch arrivals. We 
assume that new customers of a class do not arrive if the 
corresponding server is full. At each time, each server chooses 
one buffer to serve and clear its packets. The objective is 
to minimize the holding cost (e.g., delay) of the customers. 
By likening a customer arrival to an attack, it is not hard 
to see that the problem can be modeled as the RMAB at 
hand under certain conditions, e.g., when the arrival process 
of each class is i.i.d. or Markovian over time (given the buffer 
is empty). Such a queuing network often arises in backorder 
control systems and peer-to-peer communication networks. 
For example, in a backorder control system, random orders 
for N commodities arrive at a seller and the seller needs 
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Fig. 4. The performance of Whittle index policy versus the 

myopic policy (K = 2, N = 8, Markovian state model, 
{9n}n=l,2,- ,8 = [.2, .3, .3, .5, .6, .7, .7, .8], {cn} n =l,2,- ,8 = 
[2.5, 2, 1.8, 1.5, 1.2, 1, .6, .5], all components start from the healthy state). 
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Fig. 5. The equivalent queuing model of RMAB-IDS. 



to decide which K commodities to check and process the 
corresponding orders at a given time. For each commodity 
and at each time, a backorder incurs a cost depending on the 
level of urgency and/or value of the order. In a peer-to-peer 
communication network, there are TV communication links 
where each link has a pair of nodes exchanging messages. At 
each time, only K links can be turned on for communications 
and the cost can be modeled as the delay of each message. 
A potential future direction is to study the case in which the 
buffer can be partially full and new arrivals come regardless of 
the state of the buffer. The joint minimization of the holding 
cost and the customer loss cost can be considered. Such 
scenario is essentially a generalized version of the RMAB 
with stochastically time-varying instantaneous cost c n (t). It is 
also interesting to extend the RMAB to partial reset models 
for handling more general customer arrival processes. 

VIII. Conclusion 

In this paper, we studied the intrusion detection problem in 
large cyber networks under general attack processes. By adopt- 
ing a reset model of the network dynamics, we formulated the 
problem as a class of RMAB under a strong average-reward 
criterion. We showed that this class of RMAB is indexable and 
Whittle index can be solved in closed-form. This result leads 
to a low-complexity implementation of Whittle index policy 
that achieves a near-optimal performance. We further showed 
that for homogeneous components, Whittle index policy can 
be implemented without knowing the system parameters and 



is optimal over both finite and infinite time horizons. 
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